Feature/issue 3311 test thread tbb exp#3314
Feature/issue 3311 test thread tbb exp#3314drezap wants to merge 24 commits intostan-dev:developfrom
Conversation
parallel_for, blocked range compiles for stan::math::exp compiling blocked_range works fine some progress, now a type deduction issue? ok something closer... implement struct version for parallel_for... uncompiled begin new class to use parallel for almost compiles... getting close, have template deduction failed which we can figure out almost compiles hold on compiles remove dead code compiled parallel_for, blocked_range for stan::math::exp compiled parallel_for, blocked_range for stan::math::exp
|
Hold on, sorry I should re-base. I have some questions, wondering if anyone had comments or is this all on me? Refactor, and using threads at lower number of observations. |
…rezap/math into feature/issue-3311-test-thread-tbb-exp
|
Do you have a graph that shows the speedup? Overall I'd be kind of cautious introducing lower level threading like this. Like you saw, whether you get a speedup or slowdown depends a lot on the number of observations. So for every vector operations we would have to have a check that the size exceeded some threshold. That threshold is going to vary a lot per computer and I think I think if we are not careful could make the codebase kind of funky. The other piece here is that this works for |
|
I’m thinking about, haven’t thought too far ahead yet, thank you.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, Apr 30, 2026 at 5:46 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
Do you have a graph that shows the speedup? Overall I'd be kind of
cautious introducing lower level threading like this. Like you saw, whether
you get a speedup or slowdown depends a lot on the number of observations.
So for every vector operations we would have to have a check that the size
exceeded some threshold. That threshold is going to vary a lot per computer
and I think I think if we are not careful could make the codebase kind of
funky.
The other piece here is that this works for prim functions of double
type, but parallelism is much harder for reverse mode which is the main
piece of the math library we worry about. The main issue is handling how
the global AD tape should sync when we have jobs across N threads.
@andrjohns <https://github.com/andrjohns> thought for a long while trying
to figure out how to do a nice parallel map(...) style function for
reverse mode autodiff. I'm not sure he came up with something he found
satisfying. I have not either honestly. Essentially you need to shard the
operation over N shards which will have N autodiff stacks, then once the
parallel computation is done you have to pass those autodiff stacks back
and put them onto the main thread's stack. So there you would get
performance benefits for setting up the forward pass in parallel, but then
the reverse pass would still be serial and you pay the cost of the sharding
and thread startup. I'm very certain there is a way to do it so you can do
the forward and reverse pass in parallel, but nothing has ever come to me
for this problem.
—
Reply to this email directly, view it on GitHub
<#3314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543AUG7YY66QH7E5MAKL4YPCSVAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGNJWGM3DSNBVGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
I'm doing continuous integration tests, it looks like it's mostly passing now.
And I need to consider threading the rev autodiff stack, that would be cool, if different threads could build different expression trees, I think that's what Steve was saying. But if this adds incremental speed increase, why not? WRT Steves comment I can think about it, but here I'm not parallelizing anything on the stack, just evaluation of the computation of |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
|
Not sure why Jenkins emailed me SUCCESS when there's so many errors? I'm not seeing these locally. I also named the branch wrong, but I'll just leave it until it's closed... |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
This reverts commit 7ca2f6d.
|
Forwarding made it way slower. I'm burning resources but I want to see if that forwarding caused a huge slowdown. I reverted the last commit. |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
|
And then what happened between the last 3 commits (two commits and a revert to the first in the last three) that made the speed reduce? Is it just terminating runs early because rev/mix/fwd isn't passing? Not sure why the speed increase at Commit hash: e0729e1cdec40e8ec3da60b40b20a2cfc223fc94. |
|
Any comments about these 3 benchmarks? The only thing I changed was adding perfect forwarding (Containter &&) in the function input on run 2, and it seemed to slow it down. I have some suppositions but someone with more expertise may have an idea. I need to read about it. More benchmarks would waste a lot more resources, I'm not sure the std. dev. of runtimes. |
|
I wouldn’t give much credence to the model benchmarks unless you know they’re using a function you’re editing or they change by a huge amount. I don’t even know that they run on the same hardware every time. you’ll want to do your own specific benchmarking for a proposal like this |
Summary
I wrote a class that contains an
operatorfor exp, which allows use to usetbbfor parallelization of a for loop. It looks like at lower number of observations, the parallelization is marginal, but at higher number of observations the parallelism of the for loop, usingtbb::parallel_for, for example, at ~=32,000 there seems to be a speed up at 4 threads that sustains as we increase the size of theContainer.Tests
I tested for numerical accuracy, which checks out. Moreover, I did the following performance tests:
Side Effects
Yes. If we kick in threads too early, there's actually a slow down in computing
expon a vector with a lower number of observations. May be it would be good if there was a default min threads, or have them kick in only when dataset is a certain size. Moreover, this is just one function, so the result may be different when we have a composite function (Gaussian). I think this may be advantageous at lower number observations, but have not evaluated this.What I've done is added a directive that runs the multithreaded code for only vector, and calls the original code (but it's copy pasted into the STAN_THREADS section) accordingly if the function is not threaded for
exp. I'd be open to a quick re-factor if we wanted to set it up like openCL, and have athreadsdirectory understan\math\prim.Release notes
?
Checklist
Copyright holder: (Andre Zapico, Likely LLC, 2026)
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit)make test-headers)make test-math-dependencies)make doxygen)make cpplint)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested